Analysis on the Movie Lens dataset using pandas

I am creating the notebook for the mini project for course DSE200x - Python for Data Science on edX. The project requires each participant to complete the following steps:

Selecting a dataset

Exploring the dataset to identify what kinds of questions can be answered using the dataset

Identifying one research question

Using pandas methods to explore the dataset - this also involves using visualization techniques using matplotlib

Reporting findings/analyses

Presenting the work in the given presentation template

Selecting a dataset

The mini projects requires us to choose from among three datasets that have been explored through the course previously. I have selected the movie lens dataset, also known as the IMDB Movie Dataset.

The dataset is available for download here - https://grouplens.org/datasets/movielens/20m/

Description about the dataset, as shown on the website is below:

This dataset (ml-20m) describes 5-star rating and free-text tagging activity from MovieLens, a movie recommendation service. It contains 20000263 ratings and 465564 tag applications across 27278 movies. These data were created by 138493 users between January 09, 1995 and March 31, 2015. This dataset was generated on October 17, 2016.

Users were selected at random for inclusion. All selected users had rated at least 20 movies. No demographic information is included. Each user is represented by an id, and no other information is provided.

The data are contained in six files, genome-scores.csv, genome-tags.csv, links.csv, movies.csv, ratings.csv and tags.csv. More details about the contents and use of all these files follows.

This and other GroupLens data sets are publicly available for download at http://grouplens.org/datasets/.



In [199]:

    
# The first step is to import the dataset into a pandas dataframe. 

import pandas as pd

#path = 'C:/Users/hrao/Documents/Personal/HK/Python/ml-20m/ml-20m/'
path = '/Users/Harish/Documents/HK_Work/Python/ml-20m/'

movies = pd.read_csv(path+'movies.csv')
movies.shape









    Out[199]:





(27278, 3)



In [200]:

    
tags = pd.read_csv(path+'tags.csv')
tags.shape









    Out[200]:





(465564, 4)



In [201]:

    
ratings = pd.read_csv(path+'ratings.csv')
ratings.shape









    Out[201]:





(20000263, 4)



In [202]:

    
links = pd.read_csv(path+'links.csv')
links.shape









    Out[202]:





(27278, 3)

Exploring the dataset

Identifying the questions that can be answered using the dataset



In [203]:

    
movies.head()









    Out[203]:






  
    
      
      movieId
      title
      genres
    
  
  
    
      0
      1
      Toy Story (1995)
      Adventure|Animation|Children|Comedy|Fantasy
    
    
      1
      2
      Jumanji (1995)
      Adventure|Children|Fantasy
    
    
      2
      3
      Grumpier Old Men (1995)
      Comedy|Romance
    
    
      3
      4
      Waiting to Exhale (1995)
      Comedy|Drama|Romance
    
    
      4
      5
      Father of the Bride Part II (1995)
      Comedy



In [204]:

    
tags.head()









    Out[204]:






  
    
      
      userId
      movieId
      tag
      timestamp
    
  
  
    
      0
      18
      4141
      Mark Waters
      1240597180
    
    
      1
      65
      208
      dark hero
      1368150078
    
    
      2
      65
      353
      dark hero
      1368150079
    
    
      3
      65
      521
      noir thriller
      1368149983
    
    
      4
      65
      592
      dark hero
      1368150078



In [205]:

    
ratings.head()



In [206]:

    
links.head()

Based on the above exploratory commands, I believe that the following questions can be answered using the dataset:

Is there a correlation or a trend between the year of release of a movie and the genre?

Which genres were more dominant in each decade of the range available in the dataset?

Do science fiction movies tend to be rated more highly than other movie genres?

For the mini-project, I have chosen question 3 for further analysis.

Using pandas methods to explore the dataset

Includes matplotlib visualization



In [207]:

    
# List of genres as a Python list 

genres = ['Action','Adventure','Animation','Children','Comedy','Crime','Documentary','Drama','Fantasy','Film-Noir','Horror','Musical','Mystery','Romance','Sci-Fi','Thriller','War','Western']



In [208]:

    
genres_rating_list = []



In [209]:

    
# The loop reads each element of the above list
    # For each iteration, one genre is selected from the movies data frame
    # This selection of the data frame is then merged with the rating data frame to get the rating for that genre
    # Once the new merged data frame is created, we use the mean function to get the mean rating for the genre
    # The genre and the corresponding mean rating are then appended to the genres_rating Data Frame
    # The entire looping takes long - can certainly be optimized for performance
    
for i in range(len(genres)):
    fil = genres[i]+'_filter'
    mov = genres[i]+'_movies'
    rat = genres[i]+'_ratings'
    rat_mean = rat+'_mean'
    fil = movies['genres'].str.contains(genres[i])
    mov = movies[fil]
    rat = mov.merge(ratings, on='movieId', how='inner')
    rat_mean = round(rat['rating'].mean(), 2)
    #print(genres[i], round(rat_mean,2))
    genres_rating_list.append(rat_mean)



In [210]:

    
df = {'Genre':genres, 'Genres Mean Rating':genres_rating_list}



In [211]:

    
genres_rating = pd.DataFrame(df)



In [212]:

    
genres_rating









    Out[212]:






  
    
      
      Genre
      Genres Mean Rating
    
  
  
    
      0
      Action
      3.44
    
    
      1
      Adventure
      3.50
    
    
      2
      Animation
      3.62
    
    
      3
      Children
      3.41
    
    
      4
      Comedy
      3.43
    
    
      5
      Crime
      3.67
    
    
      6
      Documentary
      3.74
    
    
      7
      Drama
      3.67
    
    
      8
      Fantasy
      3.51
    
    
      9
      Film-Noir
      3.97
    
    
      10
      Horror
      3.28
    
    
      11
      Musical
      3.56
    
    
      12
      Mystery
      3.66
    
    
      13
      Romance
      3.54
    
    
      14
      Sci-Fi
      3.44
    
    
      15
      Thriller
      3.51
    
    
      16
      War
      3.81
    
    
      17
      Western
      3.57



In [213]:

    
genres_rating['Genres Standard Deviation'] = genres_rating['Genres Mean Rating'].std()



In [214]:

    
genres_rating['Mean'] = genres_rating['Genres Mean Rating'].mean()
genres_rating['Zero'] = 0



In [215]:

    
genres_rating









    Out[215]:






  
    
      
      Genre
      Genres Mean Rating
      Genres Standard Deviation
      Mean
      Zero
    
  
  
    
      0
      Action
      3.44
      0.163244
      3.573889
      0
    
    
      1
      Adventure
      3.50
      0.163244
      3.573889
      0
    
    
      2
      Animation
      3.62
      0.163244
      3.573889
      0
    
    
      3
      Children
      3.41
      0.163244
      3.573889
      0
    
    
      4
      Comedy
      3.43
      0.163244
      3.573889
      0
    
    
      5
      Crime
      3.67
      0.163244
      3.573889
      0
    
    
      6
      Documentary
      3.74
      0.163244
      3.573889
      0
    
    
      7
      Drama
      3.67
      0.163244
      3.573889
      0
    
    
      8
      Fantasy
      3.51
      0.163244
      3.573889
      0
    
    
      9
      Film-Noir
      3.97
      0.163244
      3.573889
      0
    
    
      10
      Horror
      3.28
      0.163244
      3.573889
      0
    
    
      11
      Musical
      3.56
      0.163244
      3.573889
      0
    
    
      12
      Mystery
      3.66
      0.163244
      3.573889
      0
    
    
      13
      Romance
      3.54
      0.163244
      3.573889
      0
    
    
      14
      Sci-Fi
      3.44
      0.163244
      3.573889
      0
    
    
      15
      Thriller
      3.51
      0.163244
      3.573889
      0
    
    
      16
      War
      3.81
      0.163244
      3.573889
      0
    
    
      17
      Western
      3.57
      0.163244
      3.573889
      0



In [216]:

    
overall_mean = round(genres_rating['Genres Mean Rating'].mean(), 2)
overall_std = round(genres_rating['Genres Mean Rating'].std(),2)
scifi_rating = genres_rating[genres_rating['Genre'] == 'Sci-Fi']['Genres Mean Rating']



In [217]:

    
print(overall_mean)



In [218]:

    
print(overall_std)



In [219]:

    
print(scifi_rating)









    



14    3.44
Name: Genres Mean Rating, dtype: float64



In [220]:

    
genres_rating['Diff from Mean'] = genres_rating['Genres Mean Rating'] - overall_mean



In [221]:

    
genres_rating









    Out[221]:






  
    
      
      Genre
      Genres Mean Rating
      Genres Standard Deviation
      Mean
      Zero
      Diff from Mean
    
  
  
    
      0
      Action
      3.44
      0.163244
      3.573889
      0
      -0.13
    
    
      1
      Adventure
      3.50
      0.163244
      3.573889
      0
      -0.07
    
    
      2
      Animation
      3.62
      0.163244
      3.573889
      0
      0.05
    
    
      3
      Children
      3.41
      0.163244
      3.573889
      0
      -0.16
    
    
      4
      Comedy
      3.43
      0.163244
      3.573889
      0
      -0.14
    
    
      5
      Crime
      3.67
      0.163244
      3.573889
      0
      0.10
    
    
      6
      Documentary
      3.74
      0.163244
      3.573889
      0
      0.17
    
    
      7
      Drama
      3.67
      0.163244
      3.573889
      0
      0.10
    
    
      8
      Fantasy
      3.51
      0.163244
      3.573889
      0
      -0.06
    
    
      9
      Film-Noir
      3.97
      0.163244
      3.573889
      0
      0.40
    
    
      10
      Horror
      3.28
      0.163244
      3.573889
      0
      -0.29
    
    
      11
      Musical
      3.56
      0.163244
      3.573889
      0
      -0.01
    
    
      12
      Mystery
      3.66
      0.163244
      3.573889
      0
      0.09
    
    
      13
      Romance
      3.54
      0.163244
      3.573889
      0
      -0.03
    
    
      14
      Sci-Fi
      3.44
      0.163244
      3.573889
      0
      -0.13
    
    
      15
      Thriller
      3.51
      0.163244
      3.573889
      0
      -0.06
    
    
      16
      War
      3.81
      0.163244
      3.573889
      0
      0.24
    
    
      17
      Western
      3.57
      0.163244
      3.573889
      0
      0.00

Now that we have a data frame of information about each genre and the corresponding mean rating, we will visualize the data using matplotlib



In [222]:

    
genre_list = list(genres_rating['Genre'])



In [223]:

    
genres_rating_list = list(genres_rating['Genres Mean Rating'])
genres_diff_list = list(genres_rating['Diff from Mean'])



In [224]:

    
%matplotlib inline
import matplotlib.pyplot as plt

plt.figure(figsize=(20, 10))

ax1 = plt.subplot(2,1,1)
x = [x for x in range(0, 18)]
xticks_genre_list = genre_list
y = genres_rating_list
plt.xticks(range(len(x)), xticks_genre_list)
plt.scatter(x,y, color='g')
plt.plot(x, genres_rating['Mean'], color="red")
plt.autoscale(tight=True)
#plt.rcParams["figure.figsize"] = (10,2)
plt.title('Movie ratings by genre')
plt.xlabel('Genre')
plt.ylabel('Rating')
plt.ylim(ymax = 4, ymin = 3)
plt.grid(True)
plt.savefig(r'movie-ratings-by-genre.png')

plt.annotate("Sci-Fi Rating",
            xy=(14.25,3.5), xycoords='data',
            xytext=(14.20, 3.7), textcoords='data',
            arrowprops=dict(arrowstyle="->",
                            connectionstyle="arc3"),
            )

for i,j in enumerate( y ):
    ax1.annotate( j, ( x[i] + 0.03, y[i] + 0.02))

ax2 = plt.subplot(2,1,2)
x = [x for x in range(0, 18)]
xticks_genre_list = genre_list
y = genres_rating['Diff from Mean']
plt.xticks(range(len(x)), xticks_genre_list)
plt.plot(x,y)
plt.plot(x, genres_rating['Zero'])
plt.autoscale(tight=True)
#plt.rcParams["figure.figsize"] = (10,2)
plt.title('Deviation of each genre\'s rating from the overall mean rating')
plt.xlabel('Genre')
plt.ylabel('Deviation from mean rating')
plt.grid(True)
plt.savefig(r'deviation-from-mean-rating.png')

plt.annotate("Sci-Fi Rating",
            xy=(14,-0.13), xycoords='data',
            xytext=(14.00, 0.0), textcoords='data',
            arrowprops=dict(arrowstyle="->",
                            connectionstyle="arc3"),
            )


plt.show()

Reporting findings/analyses

Now that we have a couple plots, let us revisit the question we want to answer using the dataset.

Again, the question is - Do science fiction movies tend to be rated more highly than other movie genres?

The scatter plot shows the mean rating value for each genre. Each genre has a value on the scatter plot for the mean rating value for that genre. Let us now see if the plot is able to help us answer the question above.

The mean rating for Sci-Fi genre is about 3.45. When looking at the plot, we see that there are only three other genres out of 18 genres in total, that have lesser mean ratings than Sci-Fi - Horror, Children and Comedy. The remaining 10 genres have mean ratings higher than Science Fiction.

This gives us enough information to answer the question. Sci-Fi movies do not tend to be rated higher than other genres.

The second plot, a bar plot, shows how much each genre's ratings deviate from the overall mean of ratings. Science Fiction is around -0.13 lower than the mean rating of 3.58, showing lesser deviation than Horror at the lower end and Film-Noir at the higher end.

To conclude - no, science fiction movies are not rated higher than other movie genres. The ratings for science fiction movies hover around the mean ratings for all movies.

I have submitted my work to the mini project section of the course. Now, we will explore the dataset further and try to answer the remaining questions I have listed at the beginning of the notebook.

- Is there a correlation or a trend between the year of release of a movie and the genre?
- Which genres were more dominant in each decade of the range available in the dataset?



In [225]:

    
# extract year of release of each movie from the title column
# convert the data type of the movie_year column to numeric (from str)
import numpy as np
import re 
#movies['rel_year'] = movies.title.str[-5:-1]
#movies['rel_year'] = movies.title.apply(lambda x: x[-5:-1])
movies['movie_year'] = movies['title']
#movies['movie_year'] = movies['movie_year'].apply(lambda x: re.findall('\((.*?)\)',x))
movies['movie_year'] = movies['movie_year'].str.extract(r"\(([0-9]+)\)", expand=False)


# creating a new column with just the movie titles
movies['title_only'] = movies['title']
movies['title_only'] = movies['title_only'].str.extract('(.*?)\s*\(', expand=False)



In [234]:

    
#Drop all rows containing incorrect year values - such as 0, 6, 69, 500 and -2147483648
movies.drop(movies[movies.movie_year == '0'].index, inplace=True)
movies.drop(movies[movies.movie_year == '6'].index, inplace=True)
movies.drop(movies[movies.movie_year == '06'].index, inplace=True)
movies.drop(movies[movies.movie_year == '69'].index, inplace=True)
movies.drop(movies[movies.movie_year == '500'].index, inplace=True)
movies.drop(movies[movies.movie_year == '-2147483648'].index, inplace=True)



In [235]:

    
movies['movie_year'].fillna(0, inplace=True)



In [236]:

    
#convert the string values to numeric
movies['movie_year'] = pd.to_datetime(movies['movie_year'], format='%Y')

Now that we have a move year column, let us list the data types of the columns in the movies data frame.



In [237]:

    
movies.dtypes









    Out[237]:





movieId                int64
title                 object
genres                object
movie_year    datetime64[ns]
title_only            object
dtype: object

movie_year is of float64 datat type. We must convert the data type of the movie_year column to int64. Before we go ahead and do that, we must replace all NULL and inifinite entries in the column with zero. If we do not perform this step, we will get the following errror message.



In [188]:

    
#movies['movie_year'].astype(np.int64)



In [190]:

    
#movies['movie_year'] = movies['movie_year'].astype(np.int64, inplace=True)



In [191]:

    
movies.dtypes









    Out[191]:





movieId        int64
title         object
genres        object
movie_year     int64
title_only    object
dtype: object



In [253]:

    
movies.describe()









    Out[253]:






  
    
      
      movieId
    
  
  
    
      count
      27253.000000
    
    
      mean
      59811.511063
    
    
      std
      44416.171467
    
    
      min
      1.000000
    
    
      25%
      6926.000000
    
    
      50%
      67949.000000
    
    
      75%
      100255.000000
    
    
      max
      131262.000000



In [254]:

    
movies.head()









    Out[254]:






  
    
      
      movieId
      title
      genres
      movie_year
      title_only
    
  
  
    
      0
      1
      Toy Story (1995)
      Adventure|Animation|Children|Comedy|Fantasy
      1995-01-01
      Toy Story
    
    
      1
      2
      Jumanji (1995)
      Adventure|Children|Fantasy
      1995-01-01
      Jumanji
    
    
      2
      3
      Grumpier Old Men (1995)
      Comedy|Romance
      1995-01-01
      Grumpier Old Men
    
    
      3
      4
      Waiting to Exhale (1995)
      Comedy|Drama|Romance
      1995-01-01
      Waiting to Exhale
    
    
      4
      5
      Father of the Bride Part II (1995)
      Comedy
      1995-01-01
      Father of the Bride Part II



In [290]:

    
movies_and_years = pd.DataFrame(movies['movie_year', 'title_only'])









    



---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
/anaconda/lib/python3.6/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
   2133             try:
-> 2134                 return self._engine.get_loc(key)
   2135             except KeyError:

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)()

KeyError: ('movie_year', 'title_only')

During handling of the above exception, another exception occurred:

KeyError                                  Traceback (most recent call last)
<ipython-input-290-479897ee2471> in <module>()
----> 1 movies_and_years = pd.DataFrame(movies['movie_year', 'title_only'])

/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in __getitem__(self, key)
   2057             return self._getitem_multilevel(key)
   2058         else:
-> 2059             return self._getitem_column(key)
   2060 
   2061     def _getitem_column(self, key):

/anaconda/lib/python3.6/site-packages/pandas/core/frame.py in _getitem_column(self, key)
   2064         # get column
   2065         if self.columns.is_unique:
-> 2066             return self._get_item_cache(key)
   2067 
   2068         # duplicate columns & possible reduce dimensionality

/anaconda/lib/python3.6/site-packages/pandas/core/generic.py in _get_item_cache(self, item)
   1384         res = cache.get(item)
   1385         if res is None:
-> 1386             values = self._data.get(item)
   1387             res = self._box_item_values(item, values)
   1388             cache[item] = res

/anaconda/lib/python3.6/site-packages/pandas/core/internals.py in get(self, item, fastpath)
   3541 
   3542             if not isnull(item):
-> 3543                 loc = self.items.get_loc(item)
   3544             else:
   3545                 indexer = np.arange(len(self.items))[isnull(self.items)]

/anaconda/lib/python3.6/site-packages/pandas/indexes/base.py in get_loc(self, key, method, tolerance)
   2134                 return self._engine.get_loc(key)
   2135             except KeyError:
-> 2136                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   2137 
   2138         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4433)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:4279)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13742)()

pandas/src/hashtable_class_helper.pxi in pandas.hashtable.PyObjectHashTable.get_item (pandas/hashtable.c:13696)()

KeyError: ('movie_year', 'title_only')



In [289]:

    
movies_and_years.tail()









    Out[289]:






  
    
      
      movie_year
    
    
      title_only
      
    
  
  
    
      Kein Bund für's Leben
      NaT
    
    
      Feuer, Eis & Dosenbier
      NaT
    
    
      The Pirates
      NaT
    
    
      Rentun Ruusu
      NaT
    
    
      Innocence
      NaT



In [284]:

    
#a = pd.DataFrame(movies['title_only'].groupby(movies['movie_year']).count(), movies['movie_year'], movies['title_only'])



In [285]:

    
#plt.plot(a['title_only'], a['movie_year'])
#plt.show()



In [250]:

    
#movies_and_years



In [251]:

    
#movies_and_years['title_only'].groupby[movies_and_years['movie_year']]



In [ ]:

    
#create a unique array of years from the newly created data frame column 
#years_array_raw = movies['movie_year'].unique()
#years_array_raw  = years_array_raw.astype(int)



In [ ]:

    
#years_array_raw

In the above array, we see some incorrect year values. And the year values need to be sorted as well.



In [ ]:

    
# We see some elements of the array that are not year values - such as  -2147483648 and 6
#Let us create a new numpy array to clean up the data and sort the list of unique years



In [ ]:

    
# Numpy arrays are immutable. Delete operations on the array will create a new copy of the same array. So, 
#years_array_clean = np.array([])



In [ ]:

    
#for i in range(len(years_array_raw)):
#    if years_array_raw[i] <= 2015 and years_array_raw[i] >= 1800:
 #       years_array_clean = np.append(years_array_clean, years_array_raw[i])
 #   else:
   #     print(years_array_raw[i], ' - Incorrect year value. Will not be appended to the new array')



In [ ]:

    
#the incorrect year values are not present in the array anymore
#years_array_clean



In [ ]:

    
#sorting the years in ascending order
#years_array_clean = np.sort(years_array_clean)



In [ ]:

    
#Now, we have a unique, clean and sorted year list
#years_array_clean



In [245]:

    
# years and their respective movie counts
#movies_and_years = pd.DataFrame(movies['movie_year'],(movies['title_only'].groupby(movies['movie_year']).agg('count')))
#movies_and_years1 = pd.DataFrame(movies['movie_year'],(movies['title_only'].groupby(years_array_clean)))



In [ ]:

    
# Movies and their corresponding genre combinations
# This is not the ideal grouping I want to arrive at - I want to analyze movies by individual genres, not groups of genres such as Comedy|Drama|Romance, for example.
# That will be the next step, possibly by creating new columns with flags for each genre
#movies_and_genres = pd.DataFrame(movies['genres'], (movies['title_only'].groupby(movies['genres']).agg('count')))

	movieId	imdbId	tmdbId
0	1	114709	862.0
1	2	113497	8844.0
2	3	113228	15602.0
3	4	114885	31357.0
4	5	113041	11862.0

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy

	userId	movieId	tag	timestamp
0	18	4141	Mark Waters	1240597180
1	65	208	dark hero	1368150078
2	65	353	dark hero	1368150079
3	65	521	noir thriller	1368149983
4	65	592	dark hero	1368150078

	userId	movieId	rating	timestamp
0	1	2	3.5	1112486027
1	1	29	3.5	1112484676
2	1	32	3.5	1112484819
3	1	47	3.5	1112484727
4	1	50	3.5	1112484580

	Genre	Genres Mean Rating
0	Action	3.44
1	Adventure	3.50
2	Animation	3.62
3	Children	3.41
4	Comedy	3.43
5	Crime	3.67
6	Documentary	3.74
7	Drama	3.67
8	Fantasy	3.51
9	Film-Noir	3.97
10	Horror	3.28
11	Musical	3.56
12	Mystery	3.66
13	Romance	3.54
14	Sci-Fi	3.44
15	Thriller	3.51
16	War	3.81
17	Western	3.57

	Genre	Genres Mean Rating	Genres Standard Deviation	Mean
0	Action	3.44	0.163244	3.573889
1	Adventure	3.50	0.163244	3.573889
2	Animation	3.62	0.163244	3.573889
3	Children	3.41	0.163244	3.573889
4	Comedy	3.43	0.163244	3.573889
5	Crime	3.67	0.163244	3.573889
6	Documentary	3.74	0.163244	3.573889
7	Drama	3.67	0.163244	3.573889
8	Fantasy	3.51	0.163244	3.573889
9	Film-Noir	3.97	0.163244	3.573889
10	Horror	3.28	0.163244	3.573889
11	Musical	3.56	0.163244	3.573889
12	Mystery	3.66	0.163244	3.573889
13	Romance	3.54	0.163244	3.573889
14	Sci-Fi	3.44	0.163244	3.573889
15	Thriller	3.51	0.163244	3.573889
16	War	3.81	0.163244	3.573889
17	Western	3.57	0.163244	3.573889

	movieId
count	27253.000000
mean	59811.511063
std	44416.171467
min	1.000000
25%	6926.000000
50%	67949.000000
75%	100255.000000
max	131262.000000

	movieId	title	genres	movie_year	title_only
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	1995-01-01	Toy Story
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy	1995-01-01	Jumanji
2	3	Grumpier Old Men (1995)	Comedy\|Romance	1995-01-01	Grumpier Old Men
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance	1995-01-01	Waiting to Exhale
4	5	Father of the Bride Part II (1995)	Comedy	1995-01-01	Father of the Bride Part II

	movie_year
title_only
Kein Bund für's Leben	NaT
Feuer, Eis & Dosenbier	NaT
The Pirates	NaT
Rentun Ruusu	NaT
Innocence	NaT